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(54) A cache coherency mechanism 



(57) A computer system has a plurality of proces- 
sors each for executing a sequence of instructions and 
at least one of the processors having a cache memory 
associated therewith. A memory provides an address 
space of that processor where data Hems are stored for 
use by all of the processors. A behaviour store holds in 
association with the address of each item a cache be- 
haviour identifying the cacheable behaviour ol the item, 
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the cacheable behaviours including a software coherent 
behaviour and an automatically coherent behaviour. 
When a cache coherency operation is instigated by a 
cache coherency Instruction, the operation is effecied 
dependent on the cacheable behaviour of the specified 
address of the item. 

A method of modifying the coherency status of "a 
cache in this manner is also described. 
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Description 

[0001] The present invention relates to a computer system and to a cache coherency mechanism therefor. 
[0002] As is well known in the art, cache memories are used in computer systems to decrease the access latency 
s to certain data and code and to decrease the memory bandwidth used for that data and code. A cache memory can 
delay, aggregate and reorder memory accesses. 

[0003] A cache memory operates between a processor and a main memory of a computer Data and/or instructions 
which are required by the process running on the processor can be held in the cache while that process runs. An 
access to the cache is normally much quicker than an access to main memory. If the processor does not locate a 
io required data item or instruction item in the cache memory, it directly accesses main memory to retrieve it, and the 
requested data or Instruction item is loaded into the cache. There are various known systems for using and refilling 
cache memories. 

[0004] A computer system may have more than one processor, and each processor may have its own cache. Alter- 
natively, a processor may have a plurality of CPUs, each with its own cache. However, these caches will commonly 

is access a single main memory resource. 

[0005] Figure 1 illustrates a case where there are two processors (2) CPU1 ,CPU2 each with their own cache (22) 
CACHE 1 ,CACHE2. The caches share a single memory resource MEM6. Figure 2 shows what can happen in such a 
situation. Consider an address in main memory 1010. This maps onto cache location 10 in both CACHE1 and CACHE2. 
The value V 3 stored at address 1010 had an initial value of X, and the value V 3 = X was initially stored at cache location 

20 10 in both of the caches. At that stage, the data item V 3 was "visible*, that is either processor accessing address 1010 
would retrieve from Its cache the value V 3 = X However, the CPU1 has executed a process, modified the value V 3 = 
Y and returned this to the location 10 in CACHE 1 . Now, the value V 3 = X in main memory is "dirty* - it no longer reflects 
the current value of V 3 . Moreover, the value V 3 = X in CACHE2 is "stale" - it differs from the true value. Clearly, this 
situation needs to be rectified before CPU2 attempts to retrieve V 3 , because otherwise it will wrongly retrieve V 3 = X. 

2$ [0006] Thus cache coherency control is required to ensure that several processors and devices can correctly share 
memory. This can be achieved by: 

1. Automatic coherency. Additional hardware guarantees that loads can retrieve the most recently written value 
regardless of which processor or device wrote it. Note that a functional, but low performance, implementation of 

30 automatic coherency is to disable the cache. Such additional hardware COHERE is referenced 3 in Figure 1 . 

2. Software coherency. Special code sequences are used in the program to control the transfer of data between 
cache and memory. They allow precise control of coherency and efficient use of the cache. 

35 [0007] The visibility of data depends on whether the cache is automatically coherent or not. If the cache is not au- 
tomatically coherent then only the contents of memory and its own cache are visible to a processor. Software has to 
cooperate to ensure that data is written to memory when appropriate. If the cache is automatically coherent then the 
most recently written value by any processor will be visible to alf other processors. 

40 Visibility definitions. 

[0008] 

Visible A data item is visible to a processor if a load from the data item's address will return that item. 

4$ 

Stale A data item is stale if the value in the cache is different from the last value written. 
Dtrty A data item is dirty if it has been modified in the cache with respect to main memory. 

&> [0009] In a situation where a process wishes to clear a location in the cache, but the process does not have access 
Jo the address stored at that cache location, existing software coherency techniques require usage of a special, priv- 
ileged mode of processor operation termed kernel mode. In a normal user mode it Is not possible in such a circumstance 
to render the cache coherent using software coherency techniques other than by transfer into kernel mode. 
[0010] Furthermore, contemporary processors, which have flexible cache coherency mechanisms, usually require 

BS software to specify, either by a property of the page translation or by the execution of instructions, the extent to which 
coherency will be actively managed by instruction sequences and the extent to which hardware will be responsible for 
maintaining coherency. This leads to the problem that code written for one model will not provide coherency if imple- 
mented on hardware with different coherency restrictions. For example, software written assuming a hardware coher* 
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been provided. 4U . „ milirlftrf a computer system comprising a plurality 

poll] According to one aspect o, the present .nve ^^^^^^^ and at least one of the 
of processors, each comprising an execution unrt for ^ecut.ng^ o( caC he locations tor holding items for 
pressors having associated there with a cache memory *Z address space of said at 

use by the processor, storage circuitry havng a °^^behaviour store for holding in association 

.east one processor in whteh items are stored or use by me JSKwbur ol tne „ em . herein the cacheable 
with an address of an Hem a cache behavK>ur ident.fy.ng the behaviour and wherein the instruc- 

tors include a software coherent behaviour and ^J^^^Si^ specify an operation to be 
tions for execution by the execution unit include ^J^^^S^^'^ 

coherency unit for automatically implementing . **™«^ auto matically implementing coherency, the cache 
[0012] Where the computer compr.ses a cache c ^ nc * coherent and software coherent behaviour 

coherency instructions effect acache ^ 6 ^ e f^Z!T^^Z has no cache coherency unit for 
depended on the nature of the cache coherency J^SliSK^ « ° en,ed aCCeSS l ° *° 

automatically implementing coherency, Items having an automatically coneren 

cache memory. „ „„„ . _ rasD ective cache memory associated therewith. 

[0013] in the computer system, each of the P^esfors can •^•^J* 

The storage circuitry can be a main memory access.ble ^^J£^ a va|jd brt (wnich ^cates whether or not 



25 memory). 

[001 5] The cacheable behaviours can include. 
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an unshared behaviour for items unique to one ^^P/^J^^ sakJ slorage circuitry; and- 

memory address space ol the processors, 
roote, One type ot cache coherency instruct a flush instructs which maKes dirty «ems in the cache memory 
associated with said at least one processor visible to the °^P^Xh removes items from the cache memory. 
S3 »rSS=!SS ."ST— — ensures that staie «ems are not 

status of the contents of a cache in a computer system compnsrng a ^^^^^ therewith a cache 
unit for executing a sequence of instructions and ^^^^^3,^ space ot said at teast one 
memory, and storage circuitry having addressable storage k«^J*e for each item a cache- 

processor in which items are stored for use by the processors ' h ° ™*^^^^ 

able behaviour of the item, wherein the cacheable behav ^ ts j!^^^s ^ item a cache behaviour 
ically coherent behaviour; hoiding in a behaviour ^^^^ 5 ca he coherency instructions which 
identifying the cacheable behaviour of the ,tem, ^^J^^S^^ the cache memory and an address in 
each specify an operation to be executed on the contents oT 8 ^ ^ ^ an d effecting a cache 

the storage circuitry; executing an operation on the contents ol the ^^^"^L^enl on the cacheable 
coherency operation to render the extents cohe^ 
behaviour^lthespecifiedaddressandwhethe^^^ 

implementing coherency. • ftaHllini .. wherein the cache partition containing the 

[0020] The cache can be partitioned into a plurahty of ^ P art ^LSS« «hSah memory. More details of 
!e.evant locatton in the cache is determined in de P-*^ 98M0518 2 - 

aparticular cache partrtoninglrrvlem^ 

[0021] The cache (or each cache partition) can be drrect mapped. However. add resses. In that case. 

0022 The main memory can be organised in pages, each page ^£ ' 5£S!onl» to be executed, the 
he cihe coherency instruction can specify a page in main memory for whsh • th-opei** 

ieSon being executed tor each of the sequence of addresses in the specified page. 
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Cacheable behaviour can bo page-related. 

[0023] In the preferred embodiment, the processor has a user mode of operation and a privileged (kernel) mode of 
operation. Cache coherency instructions are executable in the user mode. 

[0024] The coherency mechanism described herein enables the writing of portable code which requires coherent 
shared memory whilst allowing performance optimisation of how the coherency is managed. 
[0025] For a better understanding of the present invention and to show how the same may be carried into effect, 
reference will now be made by way of example to the accompanying drawings in which: 

Figure 1 is a block diagram of automatic coherency control; 

Figure 2 illustrates "stale" and "dirty" data Kerns; 

Figure 3 is a block diagram of a computer incorporating a cache system; 

Figure 4 is a sketch illustrating a four way set associative cache; 

Figure 5 is a block diagram of the refill engine; 

Figure 6 illustrates an entry in the TLB; 

Figure 7 is a diagram illustrating one implementation of a cache coherency mechanism; and 
Figure 8 is a diagram illustrating a more complex implementation of a cache coherency mechanism. 

[0026] Prior to describing a cache coherency mechanism, there will first be described a cache architecture within 
which the mechanism can be implemented. 

[0027] Figure 3 is a block diagram of a computer incorporating a cache system. The computer comprises a CPU 2 
which is connected to an address bus 4 for accessing items from a main memory 6 and to a data bus 8 for returning 
items to the CPU 2. Although the data bus 8 is referred to herein as a data bus, it will be appreciated that this is for 
the return of items from the main memory 6, whether or not they constitute actual data or instructions for execution by 
the CPU. The system described herein is suitable for use on both instruction and data caches. As is known, there may 
be separate data and instruction caches, or the data and instruction cache may be combined. In the computer described 
herein, the addressing scheme is a so-called virtual addressing scheme. The address is split into a line in page address 
4a and a virtual page address 4b. The virtual page address 4b is supplied to a translation look-aside buffer (TLB) 10. 
The line in page address 4a is supplied to a look-up circuit 12. The translation look-aside buffer 10 supplies a real page 
address 14 converted Irom the virtual page address 4b to the look-up circuit 12. The look-up circuit 12 is connected 
via address and data buses 16.18 to a cache access circuit 20. Again, the data bus 18 can be for data items or in- 
structions from the main memory 6. The cache access circuit 20 is connected to a cache memory 22 via an address 
bus 24. a data bus 26 and a control bus 28 which transfers replacement information for the cache memory 22. A refill 
engine 30 is connected to the cache access circuit 20 via a refill bus 32 which transfers replacement information, data 
items (or instructions) and addresses between the refill engine and the cache access circuit. The refill engine 30 is 
itself connected to the main memory 6. 

[0028] The refill engine 30 receives from the translation look-aside buffer 10 a full real address 34. comprising the 
real page address and line in page address of an item in the main memory 6. The refill engine 30 also receives a miss 
signal on line 38 which is generated in the look-up circuit 1 2 in a manner which will be described more clearly hereinafter 
[0029] The cache memory 22 described herein is a direct mapped cache, although this is not necessary to implement 
the invention. That is, it has a plurality of addressable storage locations, each location constituting one row of the 
cache. Each row contains an hem from main memory and pari of the address in main memory of that item. Each row 
is addressable by a row address which is constituted by a number of bits representing the least significant bits of the 
address in main memory of the data items stored at that row. For example, for a cache memory having eight row, 
each row address would be three bits long to uniquely identify those rows. For example, the second row in the cache 
has a row address 001 and thus could hold any data items from main memory having an address in the main memory 
which ends in the bits 001. Clearly, in the main memory, there would be many such addresses and thus potentially 
many data items to be held at that row in the cache memory. Of course, the cache memory can hold only one data 
item at that row at any one time. 

[0030] The cache memory includes a tag RAM 23 (Figure 7) which holds for each row a tag identifying the page (by 
page address or some bits thereof) for the item held in that row in the cache. 

[0031] To provide a cache system with greater flexibility, an n-way set associative cache memory has been devefoped. 
An example of a 4-way set associative cache is illustrated in Figure 4. The cache memory is divided into four banks 
B1 ,B2,B3,B4. The banks can be commonly addressed row-wise by a common row address, as illustrated schematically 
tor one row in Figure 4. However, that row contains four cache entries, one for each bank. The cache entry for bank 
B1 is output on bus 26a, the cache entry for bank B2 is output on bus 26b, and so on for banks B3 and B4. Thus, this 
allows four cache entries for one row address (or line in page address). Each time a row is addressed, four cache 
entries are output and the real page numbers of their addresses are compared with the real page number supplied 
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from the transition look-asfcie buffer 10 to determine whch entry is the correct one. I there s a cache 
attempted access to the cache, the rem. engine 30 retrieves the requested item from £■ ma, n ^ ^ 6 " 
fnto the correct row in one of the banks, in accordance with a refill algorrthm which is based on. «*J»^£^ 
L parlicular item has been held in the cache, or other program parameters of the system. Such replacement algonthms 
are known and are not described further herein. „„ ri uo-i ^„ .^..^hihocacheablo 

fo032] Basicoperationofthecorr^ute^ 

Eours of drKrent pages dW not exist. The CPU 2 requests an item from ^^^^SZS- 
L« memory and transmits that address on address bus 4. The virtual page number ■sjppllec I tothe transtemn took 
™de buffer 10 which translates it into a real page number 1 4 according to a pred eten* ndv rtua. 1o real ^page trans 
fatten algorithm. The real page number 14 is supplied to the look-up circurt 

the look-up circuit indteates a hit and f^^^^^C MThJUv-r the rea. page number of 
held at that row of the cache memory to be returned to the CPU along oai* uu*o. r- b 

the address which was held at the addressed row in the cache memory 22 does 

supplied from the translation look-aside buffer 10. then a miss signal .s generated on line 39 to the It 
is ttw task of the refill engine 30 to retrieve the correct item from the main memory 6, using the real addresswruch s 

» ,o„ aOd^s matching In. M * »« .fcni.cant bits I, In. W» in p.* » ™" ££2Ln ln Flgll(0 5 a* 

35 from the main memory is to be located. 

rrvmi Some Dossible variations on the above described embodiment are mentioned below. 
E£ EE ^c^mbod^ent above, the address issued by the <™"^^*^^^ 
page number 4b and a line in page 4a However, the entire virtual address could be sent from the .CPU to the look up 
circuit for the cache Conversely, the CPU could issue real addresses directly to the look-up circuit. 

<o ^.JK^SoSS described above, a single cache access circuit 20 ^IZ^ttTZT^T 
on look-up and refill. However, it is also possible to provide the cache with an addif onal access port for refill, so that 
look-up and refill take place via different access ports for the cache memory 22. inetMirt ,, a \ hiack& 

[0037] in the described embodiment, the refill engine 30 and cache access crcuit 20 

However, it would be quite possible to combine their functions into a single cache access circuit which performs Doth 

data or instruction) from that address in main memory. It is not necessary for the whole of ^^M^SS^ 
held at the cache location. For example the most significant bits of the address would ^^^^X 
a tag for that cache entry and is held in the tag RAM 23. This is known in the art and is ^^^^^ 
[0039] The principles of the invention will now be described in more detail with reference to ^^T^^. 
The cache coherency mechanism revokes around the data cache instructions flush pu? e and va Udate^and howjhey 
act on the cache and memory. The action of each of these Instructions is dependent on the ' <* ^ W 
the data resides and whether the implementation has hardware to assist In the maintenance of 
[0040] Thedeveloperofthesoftwaretargets^^ 

by dint of performance or simplify. The two cacheable page types which ,.lb* '^^^J^S^SL 
herentend automatically coherent Here, software written which uses either (or both) page types will worKon a w 
variety of implementations without requiring the provision of coherency hardware. 

[0041] For data resident on pages of type automatically coherence software program does not include any cone 
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ency instructions (flush, purge, validate) to maintain coherency - the implementation guarantees this automatically. For 
data resident on pages of type software coherent, software takes full responsibility for keeping data coherent in the 
system. It does this by issuing coherency instructions to establish coherency at appropriate points in the software. 
[0042] The key advantages of this are that: 

1. It does not require the implementation to provide coherency hardware to evppon automatic coherency. Programs 
will function on all suitable platforms. 

2. Software coherent and automatically coherent pages may be mixed freely. 

3. Implementations can use special hardware to expedite coherency operations used for data resident on software 
coherent pages. 

4. Although they are not strictly necessary, coherency instructions may be issued to addresses which are on pages 
of type automatically coherent with impunity. 

[0043] There is a page type known as unshared which is designed to contain data which is private to a 6ingle CPU. 
It is normally implemented write-back. In this case the coherency instructions have simple semantics as descrfoed in 
Table I. 



Table I 



Instruction 


Action 


flush 


writeback to memory if dirty (i.e. modif) 


purge 


writeback to memory if dirty. Invalidate 


validate 


Invalidate line if not dirty. 



[0044] Because data on unshared pages are used by only one user there are no coherency implications and it is not- 
described later. 

[0045] This permits efficient libraries of software to be written without knowing the coherency implications of data on 
which the software routines in the library operate. If the software is written with coherency instructions then the library 
will function on any of the three cached data types - with varying degrees of efficiency 

[0046] Three exemplary implementations follow which further illustrate the invention. Note that in eaoh of the following 
the coherency unit has conventional functionality and design. The novelty lies in the page type usage specified in the 
TLB and the way the function of the instructions is modified by it. 

Example 1 

[0047] In the simplest implementation, data on pages of type software coherent is cached using a write-through 
procedure and data on pages of type automatically coherent is prevented from residing in the cache. 

Operation of the TLB 

[0048] As described earlier, the TLB 10 is an associative store comprising a number of entries, one for each page, 
containing the virtual address VP and 1he associated physical address RP. A TLB entry is shown in Figure 6. As can 
be seen, each entry contains thee bits (denoted CB) which indicate the cache behaviour page type. The operation of 
the cache system and coherency instructions is governed by the contents of these bits. The names of the page types 
is given in Table II. 



Table II 


Name 


CB value 1 


Unshared 




Software Coherent 




Automatically Coherent 
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Tabled (continued) 



Name 


CB value 


Uncached 


3 


Device 


4 


Reserved 


5 


Reserved 


6 


Reserved 
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30 



[0049] All memory and coherency instructions are implemented by the operation of first visiting the TLB then per- 
forming an action dependent, among other things, on the value in the CB field of the matching TLB entry. For the 
purposes of this disclosure the TLB is otherwise conventional. 

Actions of load and store instructions 

[0050] The operation of load and store instructions on data resident in the shared cache page types is summarised 
in Table III. 



Table III 



Instruction 


Software Coherent 


Automatically Coherent 


load 


if in cache fetch from cache 

if not in cache fetch line from memory to cache 


fetch operand from memory 


store 


if in cache update cached copy and memory 
if not in cache update memory 


update memory 



Actions of Coherency instructions 

[00S1] The action of flush, purge and validate instructions for this sniplest write-through implementation is described 
35 in Table IV. 



Table IV 



40 



45 



SO 





Software Coherent 


Automatically Coherent 


Flush 


No-Op 


No-Op 


Purge 


Invalidate 


No-Op 


Validate 


Invalidate 


No-Op 



Example 2 

[0052] A scheme suitable for providing automatically coherent data will now be described with reference to Figures 
1 and 7. In this implementation, data on pages of type software coherent are cached using a write-through procedure 
and data on pages of type automatically coherent may be resident in the cache and are Implemented using a policy 
of write-allocate, write through with "snooped" invalidates. 

[0053] Figure 7 illustrates a tag RAM 23 which forms part of the cache 22. The tag RAM holds address tags associated 
with items in the cache. In addition, there is a dirty bit DB end valid VB held for each location in the cache 22. The 
coherency unit 3 is similar to that referenced 3 in Figure 1 and allows automatic coherency to be implemented. 
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Actions of load and store Instructions 
Automatically coherent pages 

s [0054] For data which is resident in the cache 22 and on a page ot type automatically coherent, a load instruction 
will service the request from the cache. A 6tore instruction is implemented as write-through, 60 that all stores are 
immediately updated in the cache and main memory. Additionally, all other coherency units on the system bus perform 
a look-up based on the address LUA (look-up address) which accompanies the store. This is called snooping. If the 
address matches an address in the tag RAM 23 of the cache 22 then the coherency logic asserts a signal IV (invalidate 

to signal) which results in the valid bit VB being negated. This invalidates matching cache lines in every other cache in 
the system. For data which is not resident In the cache, and resident on a page of type automatically coherent, a read 
request simply fetches the line into the cache with no other effects. A write request WR to an address ADDR in similar 
circumstances first of all reads the data into the cache by transferring a line from main memory into a location in the 
cache, then updates the cache copy by modifying the bytes in the centre location onfy. The value is written through to 

J5 memory and the coherency unit ensures that all other cached copies are invalidated. 

Software coherent pages 

[0055] For data which is resident In the cache and on a page ot type software coherent, a load instruction will service 
so the request from the cache. If the data is not in the cache it is fetched from main memory. A store request to data which 
is resident in the cache updates main memory and the cache copy if the data is in the cache. If the store request is to 
an address not in the cache, only the main memory version is updated The data is not fetched into the cache. 
[0056] The operation of load and store instructions on data resident in the shared cache page types is summarised 
in Table V. 

2S 

Table V 



Instruction 


Software Coherent 


Automatically Coherent 


toad 


if in cache fetch from cache 

if not in cache fetch line from memory 


if in cache fetch from cache 

if not in cache fetch line from memory 


store 


if in cache update cached copy and memory 
if not in cache update memory 


if in cache update cached copy and memory, then 
invalidate all other copies 

if not in cache, fetch into cache from memory, update 
memory, then invalidate alt other cached copies 



Actions of Coherency instructions 

40 

[0057] The action of flush, purge and invalidate instructions for this example is described in Table VI. 



Table VI 





Software Coherent 


Automatically Coherent 


Flush 


No-Op 


No-Op 


Purge 


Invalidate 


Invalidate 


Validate 


Invalidate 


No-Op 



[0058] Figure 7 illustrates the operation of coherency hardware which can be used to invalidate caches on write. 
When a write occurs on the system bus the WrlteNotRead line WR and the writ* address ADDR are latched by all 
caches in the system which are not attached to the processor performing the write. Each coherency unit causes a look* 
up to occur on the write address using the look-up signal LUA. If a hit occurs (i.e, the writ© address is equivalent to 
the Address Tag and the Valid bit is set), as indicated by the match signal MS, the coherency unit causes the valid 
bit on the matched cache line to be cleared using the Invalidate signal. This invalidates the cache line. If the look-up 
results in a miss, i.e. no matching cache tag is found, then nothing further happens for that cache. For load instructions 
which miss in the cache the data is fetched from memory. 
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Example 3 

[0059] More complex hardware suitable for providing automatically coherent data may be used. In this implementa- 
tion, data on pages of type software coherent and automatically coherent are Implemented identically. They use a 
s write-allocate, write-back with "snarled" updates to other caches. The scheme is illustrated in Figures 1 and 8. 

[0060] Figure 8 illustrates the tag RAM 23 and a coherency unit 3". In addition, a data RAM 25 which forms part of 
the cache 22 is also illustrated. The data RAM 25 has a plurality of lines corresponding to fines in the tag RAM 23 for 
holding data items. 

to Actions of load and store instructions 

Load 

[0061] For data which Is resident in the cache and on a page ol type automatically coherent, or software coherent, 
15 a load instruction will service the request from the cache as normal. If the requested data is not in the cache then the 
coherent hardware will attempt to locate the data (by its address) in any of the other caches in the system. If this attempt 
fails it will fetch the data from external memory 

Store 

zo 

[0062] A store to data resident in the cache will cause the update of the copy held in the cache and additionally 
update all other caches which already hold a copy of the data. The mechanism for this is as follows. When a store 
occurs, which hits in the cache,, the cached copy is updated. Concurrently, the new value is copied onto the system 
bus and all other coherency units on the system bus perform a look-up based on the address which accompanies the 
25 store. If the address matches an address in the tag RAM then the coherency logic asserts a signal which results in the 
cache updating the line with the data on the bus. This is a conventional technique sometimes referred to as snatfing. 
For data which is not present in the cache when a store is attempted, the data is first fetched as described earlier for 
a load, and then the write takes place on the resident line. 

[0063] The operation of load and store instructions on data resident in the shared cache page types is summarised 
30 in Table VII. 



Table VII 



Instruction 


Automatically Coherent/ Software Coherent 


load 


if in cache fetch from cache 

if not in cache fetch line Irom another CPU's cache, if this fails fetch from memory I 


store 


it in cache update cached copy then update all other cached copies 

if not in cache fetch from another CPU's cache, if this fails fetch from memory. Update cache with 
store then update all other cached copies. 



Actions of coherency instructions 

45 

[0064] The action of coherency instructions for a more complex write-back implementation is described in Table VIII. 



Table VIII 





Automatically Coherent/ Software Coherent 


Flush 


No-Op 


Purge 


Write-back to memory if dirty then invalidate 


Validate 


No-Op 



[0065] Figure 8 illustrates the action of coherency hardware which can be used to update caches on write or take a 
copy of data on a read. 
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Coherency unit Store Operation 



[0066] When a store occurs on the system bus the WriteNotRead line WR and the address ADDR are latched by 
all caches in the system which are no! attached to the processor performing the store. Each such coherency unit causes 
a look-up to occur on the store address by supplying the address to the tag RAM 23 as the look-up address LUA. If 
a hit occurs, as indicated by the match signal MS, the coherency unit causes write data also called update data to 
be copied into the cache using the UpdateNotCopy signal UNC. This updates the cache line. If the look-up results 
in a miss, i.e. no matching cache tag is found, then nothing further happens for that cache. 

Coherency unit Load Operation 

[0067] When a load occurs on the system bus the WrlteNotRead line WR and the address ADDR are latched by 
all caches in the system which are not attached to the processor performing the load. Each such coherency unit causes 
a look-up to occur on the toad address by supplying it as the Look-up Address LUA. If a hit occurs, as indicated by 
the match signal MS, the coherency unit causes the cached copy, in the data RAM 25 to be copied onto the system 
bus using the copy data and read data lines and using the UpdateNotCopy signal UNC. When the data is placed 
on the system bus the processor performing the toad will copy ft into its cache. If there are several caches which have 
copies of this data a system bus arbiter (not shown) selects one (they are all the same) and uses it to drive the bus. If 
the look-up results in a miss, i.e. no matching cache tag is found, then nothing further happens for that cache. 
[0068] The present cache coherency mechanism provides the following instructions. In these instructions, the de- 
notation dmem refers to the main memory of the computer system. 



Flush 



[0069] These instructions are provided to make certain that dirty data is visible to other users. That is, the item held 
at the relevant cache location is written back to the address in main memory held at that cache location with the item. 



flushllne 



1 



flush a line base, offset 



unsigned(x) unsigned(x) 



Ensure that all previous writes to the Jine containing dmemfbase+offset] are visible to other users sharing this data. 



fiushpart 
flush a partition base, offset 



unsigned(x) unsigned(x) 



Flush a dirty cache line which could be replaced by a memory access to dmernjbase+offset]. 

Flushing a line ensures that all previous writes to that line are visible toother users sharing this data. If all replaceable 
lines are clean the instruction has no effect. 



Purge [for data Items] 

[0070] These instructions are provided to remove data from the cache - they write back data items in the cache to 
addresses in main memory specified with those items, then invalidate the cache contents. 



purgellne 



purge a line base, offset 



unsigned(x) unsigned(x) 



Write back to memory any dirty items in the line containing dmerr(base+offset] and invalidate the line in all cases 



I purg 

1 purg* 



purgepart 



purge a partition base, offset 



unsigned(x) unsigned(x) 
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(continued) 



purgepart 




Purge a valid cache line which could be replaced by a memory access to dmemjbase+offset]. 
A line is purged by writing to memory any dirty data it contains and then invalidating the line. 
If all replaceable lines are invalid the instruction has no effect ____ 



[0071] In another embodiment, the partition-based flush instruction can have the following form. In the following 
instruction, var<a:b> is bite a to b of the variable var. 



flushpart 


flush a partition base, offset 


unsigned(x) unsigned(x) 


addrt-base + offset 




addr$<1 2:63><-addr<1 2:63> 




For index=<fr. index<4096jndex+=32 
addr$<0; 1 1 >=index<0: 1 1 > 




Flush a plurality of dirty cache lines which could be replaced by a memory access to dmem [addfifr] 



[0072] The purge instruction can take a similar form where a single instruction is to operate on a plurality of lines in 
the cache. 

25 

Claims 

1. A computer system comprising a plurality ol processors, each comprising an execution unit (or executing a se- 
30 quence of instructions and at least one of the processors having associated there with a cache memory having a 

plurality of cache locations for holding items for use by the processor, 

storage circuitry having addressable storage locations in the memory address space of said at least one proc- 
essor in which items are stored for use by the processors; 

55 a behaviour store for holding in association with an address of an item a cache behaviour identifying the 

cacheable behaviour of the item, wherein the cacheable behaviours include a software coherent behaviour 
and an automatically coherent behaviour and wherein the instructions for execution by the execution unit 
include cache coherency instructions which each specify an operation to be executed on the contents of a 
cache location and an address in the storage circuitry; 

40 each processor being operable responsive to the cache coherency instructions to execute an operation on 

the contents of the specified cache location and to effect a cache coherency operation to render the contents 
coherent with the storage circuitry in a manner dependent on the cacheable behaviour of the specified address 
and whether or not the processor contains a cache coherency unit for automatically implementing coherency 



2. A computer system according to claim 1 , which comprises a cache coherency unit for automatically implementing 
coherency and wherein said cache coherency instructions effect a cache coherency operation for automatically 
coherent and software coherent behaviour dependent on the nature of the cache coherency unit. 

3. A computer system according to claim 1 , which has no cache coherency unit for automatically implementing co- 
herency, wherein items having an automatically coherent behaviour are denied access to the cache memory. 

4. A computer system according to any preceding claim, wherein each of the processors has a respective cache 
memory associated therewith. 

6. A computer system according to any preceding claim, wherein the storage circuitry comprises a main memory 
accessible by all of the processors. 
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B. A computer system according to any preceding claim, wherein each item in the or each cache memory is stored 
in association with a valid bit which indicates whether or not that item is valid. 

7. A computer system according to any preceding claim, wherein each item in the or each each© memory is stored 
in association with a dirty bit which indicates whether or not that item has been modified with respect to the storage 
circuitry. 

6. A computer system according to any preceding claim, wherein the cacheable behaviours include an unshared 
behaviour for items unique to one of said processors. 

9. A computer system according to any preceding claim, wherein the cacheable behaviours include an uncacheable 
behaviour for accesses to memory devices other than said storage circuitry. 

10. A computer system according to any preceding claim, wherein the cacheable behaviours include a device behav- 
iour for accesses to devices other than memory devices, said devices being addressable in the memory address 
space of the processors. 

11. A computer system according to any preceding claim, wherein the cache coherency instructions include a flush 
instruction which makes dirty items in the cache memory associated with said at least one processor visible to the 
other processors. 

12. A computer system according to any preceding claim, wherein the cache coherency instructions include a purge 
instruction which removes items from the cache memory. 

13. A computer system according to any preceding claim, wherein the cache coherency instructions include a coherent 
instruction which makes dirty items in the cache memory associated with said at least one processor visible to the 
other processors and ensures that stale items are not read from the cache memory. 

14. A computer system according to any preceding claim, wherein the cache coherency instructions include a validate 
instruction which ensures that stale items are not read from the cache memory. 

15. A method of modifying the coherency status of the contents of a cache in a computer system comprising a plurality 
of processors each having an execution unit for executing a sequence of instructions and at least one of the 
processors having associated therewith a cache memory, and storage circuitry having addressable storage loca- 
tions in the memory address space of said at least one processor in which items are stored for use by the proc- 
essors, the method comprising: 

defining for each item a cacheable behaviour of the item, wherein the cacheable behaviours include a software 
coherent behaviour and an automatically coherent behaviour; 

holding in a behaviour store in association with an address for each item a cache behaviour identifying the 
cacheable behaviour of the item; 

executing in the execution unit cache coherency instructions which each specify an operation to be executed 
on the contents of a cache location in the cache memory and an address in the storage circuitry; 
executing an operation on the contents of the specified cache location and effecting a cache coherency op- 
eration to render the contents coherent with the storage circuitry in a manner dependent on the cacheable 
behaviour of the specified address and whether or not the processor contains a cache coherency unit for 
automatically implementing coherency. 

16. A method according to claim 1 5, wherein the computer system comprises a cache coherency unit for automatically 
implementing coherency and wherein said cache coherency instructions effect a cache coherency operation for 
automatically coherent and software coherent behaviour dependent on the nature of the cache coherency unit 

17. A method according to claim 15, wherein the computer system has no cache coherency unit for automatically 
implementing coherency, wherein items haying an automatically coherent behaviour are denied access to the 
cache memory. 

18. A method according to claim 15, 1 6 or 1 7, wherein each item in the or each cache memory is stored in association 
with a valid bit which indicates whether or not that item is valid. 
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19. A method according to any of claims 15 to 18, wherein each item in the or each cache memory is stored in asso- 
ciation with a dirty bit which indicates whether or not that item has been modified with respect to the storage circuitry. 

20. A method according to any of claims 15 to 19. wherein the cacheable behaviours include an unshared behaviour 
5 for items unique to one of said processors. 

21 . Amethod according to any of claims 1 5 to 20, wherein the cacheable behaviours include an uncacheable behaviour 
for accesses to memory devices other than said storage circuitry. 

io 22. A method according to any of claims 1 5 to 21 , wherein the cacheable behaviours include a device behaviour for 
accesses to devices other than memory devices, said devices being addressable in the memory address space 
of the processors. 

23. Amethod according to any of claims 15 to 22, wherein the cache coherency instructions include a flush instruction 
which makes dirty items in the cache memory associated with said at least one processor visible to the other 
processors. 

24. Amethod according to any of claims 15 to 23. wherein the cache coherency instructions include a purge instruction 
which removes items from the cache memory. 

20 

25. Amethod according to any of claims 1 5 to 24, wherein the cache coherency instructions include a validate instruc- 
tion which ensures that stale items are not read from the cache memory. 
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